The objective of this analysis is to understand relationship of various features which impact the quality ratings of red wine. So, I will start by exploring the data to understand the relationship among different variables and will attempt to gain understanding of how these features impact wine quality.
So, let’s start exploring the wine data set which has 1599 obersvations with 12 explanatory variables on the chemical properties of the wine.
Data Set Link : https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Observations from the Summary
In order to explore this data visually, let’s create some visualizations -
Following are the inference from the above plots-
Let’s rescale these variables toward more normally distributed data. Skewed and long tail data can be transformed by taking square root or log function. In my case, I will do log transformation for skewed and long tail distribution.
For fixed acidity and volatile acidity, the distribution seems to be almost normal after applying log transformation. Also, for volatile acidity, the distribution seems to be slighly bimodal.
Citric acid distribution is not normal even after applying log transformation. Also, Citrix acid seems to have lot of zero values. Also, majority of values are falling between 0.2 and 0.8 for citrix acid.
Chorides, total sulfur dioxide and sulphates appears to be normally distributed after logarthmic transformation.
Residual sugar seems to be almost normal after log transformation.
Alcohol and free sulfur dioxide data seems to be bimodal.
There are 1599 observation of red wines in the dataset with 12 features . All the 11 variables are numerical variable and quality is categorical variable. there are no NA in the dataset.
The main feature of interest in the data is quality variable which is output variable. The objective is to determine a relationship between other explantory variables and quality.
Variables such as fixed.acidity ,volatile.acidity,citrix acidm alcohol content are the main predictors of the wine quality. These variable may support my investigation, however, I might gain more insight on variables once I plot the bivariate plots.
Yes, for quality ( output varialble) as factor. I also created a quality rating bucket and grouped qualtiy into poor, good, excellent.
I noticed that the distribution of citric acid is unusual. Even after applying the log transformation, this variable data is not normal.
Aside from this, some other variables such as volatile acidity,Alcohol and free sulfur dioxide the distribution seems to be bimodal.
Some of the distributions were affected by the outliers. So, I transformed them using the log transformation and they seem to be normal after transformation.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
So, let’s further examine those variables using ggplot which are having strong corelationship with each other.
Based on the correlation matrix and ggpair plots, there doesn’t seem to be strong corelation between any of the two variables. However, there are some variables which are moderately corelated with each other. Let’s examine relationship between those varialbles using bivariate plots.
The top four variables that are corelated with quality variable are alcohol, sulphates, citric.acid, volatile.acidity . The variable volatile.acidity is negatively corelated (0.391)
Fixed acidity seems to be correlated with citric acid, density and pH (negatively corelated).
Sulphates and Chlorides seem to be moderately positively correlated.
So, we can infer from the above plots that quality rating goes up with increased alcohol content . It is espcially true for excellent quality wine.
There seems to be a moderate correlation between alcohol and density variables. So, a wine with higher alcohol content have less density.
pH and Fixed.acidity have a strong negative correlation between them.
There seems to be a moderate correlation between volatile acidity and quality. Red wines with volatile acidity of less than 0.4 tend to have excellent quality.
There are lot of outliers in the data. So, looking qt the plot it seems that these 2 variables doesn’t have very strong relationship.
Total.sulfer.dioxide and free.sulfer.dioxide strongly correlated, but these are not among our main features of interest.
Sulphates and Chlorides seem to be moderately positively correlated.
pH and density have a weak correlation so when density increases, pH tends to decrease.
The strongest relationship is between fixed.acidity and pH.
It seems that rise in both citric acid and fixed.acid have not significant impact on wine quality.
it seems that lower density wines with higher alcohol content tends to produce better quality wines.
From the above plot, It seems lesser PH and more alcohol makes wine better .
None
The wine quality data seems to be normal. However, we can also infer from the data that around 80% of the data beongs to red wines which are rated as 5 and 6 i.e good quality wines as per the criteria we stated above. So, this data seems to be biased towards good quality wine as we do not have enough representatin of poor and excellent quality wine samples. The other thing we can change is the criteria we used to define poor, good and excellent quality wines but this needs further investigation of this dataset.
## winedata$quality_rating: Poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## winedata$quality_rating: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## winedata$quality_rating: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
The above is a box plot of alcohol and quality. Alcohol have strongest correlation with quality which is around 0.476. High quality wines appear to have higher alcohol content on an average as it is refleted from the above box plot.
From above statistical analysis, we can infer that the average alcohol content for high quality wine is 11.5% while good and poor quality wines have 10.25 and 10.22 respectively. The boxplot also shows that there is not much differece in alcohol content for poor and good quality wines although there seems to be many outliers in good quality wine data.
The above plot describes the effect of Alcohol Percentage and Wine Density on wine quality. The higher the alcohol percenrage, the lower is the density. This visualization also supplement our earlier hypothesis that wine with higher quality and lower density led to better quality wines.
I have a limited epxerience on R so this analysis was challening for me, but at the same time it was quite rewarding as it gave me opportunity to explore the entire wine data set and task of creating visualizations to find patterns in the data.
Through this exploratory data analysis, I was able to identify the key factors such as alcohol content, sulphates and acidity that contributes to wine quality.
In the begining, I had no idea that alcohol content has more influence on quality of wine as compared to other parameters, but the univariate, bivariate and multivariate analysis helped me to get this insight. So, this was a suprising insight for me.
Had I got some more time, I would have used regression model to fit the data in order to get more insight on wine quality and its relationship with other variables.